In this file, I will perform exploratory data analyses (EDA) and subsequent data wrangling to get the data in proper shape for downstream modeling.
From R4DS, chapter 7: Exploratory Data Analysis:
EDA is an iterative cycle. You:
- Generate questions about your data.
- Search for answers by visualising, transforming, and modelling your data.
- Use what you learn to refine your questions and/or generate new questions.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
First, we load necessary libraries.
library(tidyverse) # A whole bunch of packages necessary for data analysis
library(broom) # To effectively clean up various objects
library(ggforce) # Has some nice ggplot functionality
Load data from the previous step.
variables <- readRDS(file = "../data/processed/variables.rds")
annotation <- readRDS(file = "../data/processed/annotation.rds")
data_comb <- readRDS(file = "../data/processed/data_comb.rds")
tableoneUse tableone to get a quick feeling on the population.
tableone::CreateTableOne(
data = data_comb %>% filter(time == "base"),
vars = variables$clinic,
strata = "group",
factorVars = variables$clinic_fac
)
## Stratified by group
## control intervention p test
## n 52 47
## age (mean (SD)) 55.19 (9.79) 53.62 (9.68) 0.424
## gender = man (%) 21 (40.4) 20 (42.6) 0.988
## hypertensive_med = yes (%) 2 ( 3.8) 3 ( 6.4) 0.908
## myocardialinfarction (%) 0.668
## no 39 (76.5) 36 (80.0)
## yes 9 (17.6) 8 (17.8)
## unknown 3 ( 5.9) 1 ( 2.2)
## smoking (%) 0.165
## regularly 8 (15.7) 2 ( 4.3)
## seldom 0 ( 0.0) 2 ( 4.3)
## quit 11 (21.6) 12 (25.5)
## never 31 (60.8) 31 (66.0)
## snuff 1 ( 2.0) 0 ( 0.0)
## smoking2 = yes (%) 9 (17.6) 4 ( 8.5) 0.301
## height (mean (SD)) 172.95 (9.12) 172.06 (9.16) 0.630
## weight (mean (SD)) 74.05 (13.14) 75.69 (12.66) 0.530
## bmi (mean (SD)) 24.63 (3.05) 25.44 (2.92) 0.180
## waist (mean (SD)) 88.61 (10.56) 90.53 (10.45) 0.365
## hip (mean (SD)) 102.00 (6.28) 102.21 (5.44) 0.859
## whratio (mean (SD)) 0.87 (0.07) 0.88 (0.08) 0.293
## sbp (mean (SD)) 127.56 (14.24) 128.23 (11.86) 0.799
## dbp (mean (SD)) 76.17 (9.18) 78.36 (8.10) 0.214
## pulse (mean (SD)) 63.52 (9.36) 64.45 (7.13) 0.583
## totalfatpercent (mean (SD)) 29.23 (8.33) 30.81 (7.73) 0.329
## totalfatweight (mean (SD)) 21.63 (7.46) 23.10 (6.43) 0.297
## fatfreeweight (mean (SD)) 52.43 (11.33) 52.56 (11.72) 0.958
## leuko (mean (SD)) 5.42 (1.40) 5.29 (1.35) 0.620
## neutro (mean (SD)) 2.88 (0.87) 2.84 (1.10) 0.879
## lymph (mean (SD)) 1.89 (0.56) 1.80 (0.51) 0.385
## mono (mean (SD)) 0.46 (0.14) 0.45 (0.12) 0.683
## eosino (mean (SD)) 0.15 (0.12) 0.17 (0.12) 0.451
## creatinine (mean (SD)) 75.48 (12.08) 74.26 (12.40) 0.620
## asat (mean (SD)) 23.65 (10.06) 21.89 (7.82) 0.337
## alat (mean (SD)) 27.17 (11.93) 27.17 (12.10) 0.999
## ggt (mean (SD)) 24.53 (29.20) 27.49 (22.10) 0.587
## alp (mean (SD)) 69.38 (21.15) 74.28 (16.73) 0.208
## hba1c (mean (SD)) 5.36 (0.23) 5.29 (0.25) 0.175
## glucose (mean (SD)) 5.23 (0.37) 5.29 (0.39) 0.458
## insulin (mean (SD)) 50.62 (28.18) 58.94 (27.54) 0.141
## tg (mean (SD)) 1.31 (0.60) 1.27 (0.49) 0.758
## tc (mean (SD)) 6.58 (0.82) 6.55 (0.82) 0.886
## hdlc (mean (SD)) 1.72 (0.43) 1.70 (0.45) 0.809
## ldlc (mean (SD)) 4.12 (0.64) 4.18 (0.64) 0.678
## ldlhdlratio (mean (SD)) 2.55 (0.74) 2.64 (0.79) 0.554
## apoa1 (mean (SD)) 1.62 (0.20) 1.59 (0.22) 0.510
## apob (mean (SD)) 1.31 (0.20) 1.30 (0.17) 0.724
## aporatio (mean (SD)) 0.82 (0.14) 0.83 (0.15) 0.811
## lpa (mean (SD)) 296.29 (283.24) 285.86 (319.50) 0.869
## crp (mean (SD)) 1.45 (1.26) 1.57 (1.57) 0.689
## tsh (mean (SD)) 2.03 (0.78) 1.84 (0.81) 0.238
## vitd (mean (SD)) 74.50 (18.80) 73.64 (21.22) 0.831
## vitd3 (mean (SD)) 74.50 (18.80) 69.36 (20.63) 0.198
The CreateTableOne function is part of the tableone package. It has a bunch of useful functionality. To learn more, check out the tableone vignette! (A vignette is a short tutorial that gives you a feeling of how you can use a specific package and its functions; they are often very useful when well-written.)
dplyrAlternatively, we can create our own table one. This gives us more flexibility about what type of summary statistics we wish to present. To this, we use the standard toolkit: dplyr and tidyr.
table_num <- data_comb %>%
# Remove the change; keep baseline and end-of-study only
filter(time != "delta") %>%
# Select the numerical clinical variables
select(time, group, variables$clinic_num) %>%
# Group by timepoint and group
group_by(time, group) %>%
# Initiate call to summarise
summarise_all(
# Use the funs selector to tell summarise what summary stats you want
.funs = funs(
# First add number of non-NAs; in other words, n for each variable
n = sum(!is.na(.)),
# Add a measure of that variables skewness
skewness = e1071::skewness,
# Then add a few other specific functions
mean, sd, median, IQR),
# Remove NAs for all these functions
na.rm = TRUE
) %>%
# Since the output is useless in the current format,
# we start to clean up by pivoting
pivot_longer(cols = -one_of("time", "group"), names_to = "variable", values_to = "values") %>%
# Then split "variable" by the separator "_"
separate(col = variable, into = c("variable", "stat"), sep = "_") %>%
# And create a new column "key" by binding together "time", "group" and "stat" by that same separator "_"
unite(col = "key", time, group, stat) %>%
# Finally pivot wider again, using the newly formed "key" column
pivot_wider(names_from = "key", values_from = "values") %>%
# We might also want to clean up this a bit, into something a bit more readable
# Start by rounding off numbers
mutate_at(
.vars = vars(matches("skewness|mean|sd|median|IQR")),
.funs = round,
digits = 2
) %>%
# Then create text strings of both the measure of centrality and variance
mutate(
# For baseline
`base_control, mean (SD)` = paste0(base_control_mean, " (", base_control_sd, ")"),
`base_control, median (IQR)` = paste0(base_control_median, " (", base_control_IQR, ")"),
`base_intervention, mean (SD)` = paste0(base_intervention_mean, " (", base_intervention_sd, ")"),
`base_intervention, median (IQR)` = paste0(base_intervention_median, " (", base_intervention_IQR, ")"),
# For end of study
`end_control, mean (SD)` = paste0(end_control_mean, " (", end_control_sd, ")"),
`end_control, median (IQR)` = paste0(end_control_median, " (", end_control_IQR, ")"),
`end_intervention, mean (SD)` = paste0(end_intervention_mean, " (", end_intervention_sd, ")"),
`end_intervention, median (IQR)` = paste0(end_intervention_median, " (", end_intervention_IQR, ")")
) %>%
select(variable, matches("_n|\\("))
table_num
And that’s it, for the numerical variables. Now off to the categorical variables, or as R knows them: the factors.
table_fac <- data_comb %>%
# Remove the change; keep baseline and end-of-study only
filter(time != "delta") %>%
# Select the numerical clinical variables
select(time, group, variables$clinic_fac) %>%
# Pivot the variables to longer format
pivot_longer(cols = -one_of("time", "group"), names_to = "variable", values_to = "value") %>%
# Group by time, group and variable
group_by(time, group, variable) %>%
# Count occurences within each grouping
count(value) %>%
# Add percentage within each grouping
mutate(perc = round((n / sum(n)) * 100, digits = 1)) %>%
# Pivot those values longer too
pivot_longer(cols = n:perc, names_to = "names", values_to = "value2") %>%
# Unite the grouping keys
unite(col = "key", time, group, names) %>%
# Pivot wider to a format that's easy to read
pivot_wider(names_from = "key", values_from = "value2") %>%
# Add the final summary column for each timepoint and group
mutate(
# For baseline
`base_control, n (%)` = paste0(base_control_n, " (", base_control_perc, ")"),
`base_intervention, n (%)` = paste0(base_intervention_n, " (", base_intervention_perc, ")"),
# For end of study
`end_control, n (%)` = paste0(end_control_n, " (", end_control_perc, ")"),
`end_intervention, n (%)` = paste0(end_intervention_n, " (", end_intervention_perc, ")")
) %>%
# Select only the final summaries
select(variable, value, matches("%"))
table_fac
It’s a bit long, but it works fine.
We could add a simple test at the end, just to see if there are any differences between the groups at these two timepoints. Let’s do that with a t-test.
This is how we do it for a single variable at baseline.
data_comb %>%
filter(time == "base") %>%
t.test(sbp ~ group, data = .) %>%
broom::tidy() %>%
pull("p.value")
## [1] 0.7972039
Now let’s scale up and do it for many variables at both timepoints.
table_num_p <- data_comb %>%
filter(time != "delta") %>%
select(time, group, variables$clinic_num) %>%
pivot_longer(cols = variables$clinic_num, names_to = "variable") %>%
group_by(time, variable) %>%
nest() %>%
mutate(
p.value = map_dbl(data,
~t.test(value ~ group, data = .x) %>%
broom::tidy() %>%
pull("p.value"))
) %>%
select(-data) %>%
pivot_wider(names_from = "time", values_from = "p.value") %>%
select(variable, base_p.value = base, end_p.value = end)
table_num_p
For the factors, we use the Chi-Squared test. This is how we do it for one variable.
x_gender <- data_comb %>% filter(time == "base") %>% pull("gender")
y_group <- data_comb %>% filter(time == "base") %>% pull("group")
chisq.test(x_gender, y_group) %>%
broom::tidy() %>%
pull("p.value")
## [1] 0.9884747
And this is how we scale up.
table_fac_p <- data_comb %>%
filter(time != "delta") %>%
select(time, group, variables$clinic_fac) %>%
pivot_longer(cols = variables$clinic_fac, names_to = "variable") %>%
group_by(time, variable) %>%
nest() %>%
mutate(
p.value = map_dbl(data,
~chisq.test(x = .x$value, y = .x$group) %>%
broom::tidy() %>%
pull("p.value"))
) %>%
select(-data) %>%
pivot_wider(names_from = "time", values_from = "p.value") %>%
select(variable, base_p.value = base, end_p.value = end)
table_fac_p
And finally, all these data can be saved to an Excel file so that you can prep it according to journal standards.
list(
"numerical" = table_num,
"numerical_p" = table_num_p,
"factor" = table_fac,
"factor_p" = table_fac_p
) %>%
openxlsx::write.xlsx(file = "../results/tables/table-clinic.xlsx", colWidths = "auto")
Let’s look at BMI using a histogram.
data_comb %>%
ggplot(aes(bmi, fill = group)) +
geom_histogram() +
scale_fill_brewer(palette = "Set1") +
facet_wrap(~time, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Is this better displayed with a density plot?
data_comb %>%
ggplot(aes(bmi, color = group)) +
geom_density() +
scale_color_brewer(palette = "Set1") +
facet_wrap(~time, scales = "free")
Perhaps - let’s do a boxplot instead.
data_comb %>%
ggplot(aes(group, bmi, color = group)) +
geom_boxplot() +
scale_color_brewer(palette = "Set1") +
facet_wrap(~time, scales = "free")
I like the boxplots. But some information is redundant now. Remove the colors from the boxes, and remove the legend. Also, add jittered points on top. Those we can color.
data_comb %>%
ggplot(aes(group, bmi)) +
geom_boxplot(outlier.alpha = 0) +
geom_point(aes(color = group), position = position_jitter(), show.legend = FALSE) +
scale_color_brewer(palette = "Set1") +
facet_wrap(~time, scales = "free")
Clean up the theme a bit.
data_comb %>%
mutate(time = case_when(
time == "base" ~ "Baseline",
time == "end" ~ "End of study",
time == "delta" ~ "Change") %>%
fct_relevel("Baseline", "End of study")) %>%
ggplot(aes(group, bmi)) +
geom_boxplot(outlier.alpha = 0) +
geom_point(aes(color = group), position = position_jitter(), show.legend = FALSE) +
scale_color_brewer(palette = "Set1") +
facet_wrap(~time, scales = "free") +
theme_classic() +
theme(strip.background = element_blank()) +
labs(x = "Group", y = "BMI (kg/m2)")
data_comb %>%
filter(time != "delta") %>%
select(time, group, variables$clinic_fac) %>%
pivot_longer(cols = -one_of("time", "group")) %>%
group_by(time, group, name) %>%
count(value) %>%
mutate(perc = n/sum(n) * 100) %>%
ggplot(aes(value, perc, fill = group)) +
geom_col(position = position_dodge()) +
scale_fill_brewer(palette = "Set1") +
coord_flip() +
facet_wrap(~name, scales = "free") +
labs(x = NULL, y = "Percentage (%)")
summary_ldlc <- data_comb %>%
filter(time %in% c("base", "end")) %>%
group_by(time, group) %>%
summarize(y = mean_cl_normal(ldlc)[[1]],
ymin = mean_cl_normal(ldlc)[[2]],
ymax = mean_cl_normal(ldlc)[[3]])
## Registered S3 method overwritten by 'htmlwidgets':
## method from
## print.htmlwidget tools:rstudio
## Registered S3 method overwritten by 'data.table':
## method from
## print.data.table
data_comb %>%
filter(time %in% c("base", "end")) %>%
ggplot(aes(time, ldlc)) +
geom_line(aes(group = id), color = "grey") +
# I can either add the manually calculated summary stats
geom_line(data = summary_ldlc, aes(y = y, color = group, group = group)) +
geom_pointrange(data = summary_ldlc, aes(y = y, ymin = ymin, ymax = ymax, color = group)) +
# Or use a pair of simple statistical summary functions
stat_summary(aes(color = group, group = group), fun.y = "mean", geom = "line") +
stat_summary(aes(color = group, group = group), fun.data = "mean_cl_normal") +
scale_color_brewer(name = NULL, palette = "Set1") +
facet_wrap(~group) +
theme_classic() +
theme(strip.background = element_blank()) +
labs(x = NULL, y = "LDL-C (mmol/L)")
Or by a simple waterfall chart.
data_comb %>%
filter(time == "delta") %>%
mutate(id.row = row_number() %>% factor()) %>%
ggplot(aes(fct_reorder(id.row, ldlc), ldlc)) +
geom_col(aes(fill = group)) +
scale_fill_brewer(name = NULL, palette = "Set1") +
theme_classic() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
labs(x = NULL, y = "LDL-C (mmol/L)")
data_comb %>%
filter(time %in% c("base", "end")) %>%
select(id, time, group, height:vitd3) %>%
pivot_longer(cols = -c(id:group), names_to = "variables", values_to = "value") %>%
left_join(annotation$clinic, by = c("variables" = "name.short")) %>%
mutate(name.unit = paste0(name.full, " (", unit, ")")) %>%
ggplot(aes(time, value)) +
geom_line(aes(group = id), color = "grey", size = 0.2) +
stat_summary(aes(color = group, group = group), fun.y = "mean", geom = "line",
position = position_dodge(width = 0.1)) +
stat_summary(aes(color = group, group = group), fun.data = "mean_cl_normal",
position = position_dodge(width = 0.1)) +
scale_color_brewer(name = NULL, palette = "Set1") +
facet_wrap(~ name.unit, ncol = 4, scales = "free") +
theme_classic() +
theme(strip.background = element_blank()) +
labs(x = NULL, y = NULL)
data_comb %>%
filter(time == "delta") %>%
select(id, time, group, height:vitd3) %>%
pivot_longer(cols = -c(id:group), names_to = "variables", values_to = "value") %>%
filter(variables != "height") %>%
left_join(annotation$clinic, by = c("variables" = "name.short")) %>%
arrange(desc(value)) %>%
mutate(name.unit = paste0(name.full, " (", unit, ")"),
id.row = row_number() %>% factor()) %>%
ggplot(aes(fct_reorder(id.row, value), value, fill = group)) +
geom_col() +
scale_fill_brewer(name = NULL, palette = "Set1") +
facet_wrap(~name.unit, ncol = 4, scales = "free") +
theme_classic() +
theme(strip.background = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
labs(x = NULL, y = NULL)
data_comb %>%
filter(time == "base") %>%
select(variables$clinic_num) %>%
cor(use = "pairwise.complete.obs") %>%
as.data.frame() %>%
rownames_to_column(var = ".rownames") %>%
pivot_longer(-.rownames, names_to = ".colnames", values_to = "r") %>%
mutate(r = if_else(r == 1, true = NA_real_, false = r)) %>%
left_join(annotation$clinic, by = c(".rownames" = "name.short")) %>%
left_join(annotation$clinic, by = c(".colnames" = "name.short")) %>%
ggplot(aes(
x = fct_reorder(name.full.x, order.variable.x),
y = fct_reorder(name.full.y, -order.variable.y))) +
geom_tile(aes(fill = r), color = "grey90") +
scale_fill_distiller(name = "Correlation \ncoefficient", palette = "RdBu", na.value = "grey70") +
facet_grid(cols = vars(fct_reorder(facet.group.x, facet.order.x)),
rows = vars(fct_reorder(facet.group.y, facet.order.y)),
scales = "free", space = "free",
labeller = label_wrap_gen(width = 10)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
strip.text.y = element_text(angle = 0)) +
labs(x = NULL, y = NULL)
Let’s plot the variation in vitamin D over the course of the year. We assume that there will be a cyclic variation between summer and winter time.
data_comb %>%
select(id, time, date, vitd) %>%
filter(time %in% c("base", "end")) %>%
ggplot(aes(date, vitd)) +
geom_line(aes(group = id), color = "grey80") +
geom_point(color = "grey60") +
geom_smooth(color = "black") +
scale_color_brewer(palette = "Set1") +
scale_x_date(date_labels = "%b %Y") +
theme_classic() +
labs(x = NULL, y = "Vitamin D (nmol/L)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And indeed there is! What about all the other variables?
data_comb %>%
select(id, time, date, variables$clinic_num) %>%
filter(time %in% c("base", "end")) %>%
pivot_longer(cols = variables$clinic_num) %>%
left_join(annotation$clinic, by = c("name" = "name.short")) %>%
ggplot(aes(date, value)) +
geom_line(aes(group = id), color = "grey80") +
geom_point(color = "grey60") +
geom_smooth(color = "black") +
scale_color_brewer(palette = "Set1") +
scale_x_date(date_labels = "%b %Y") +
facet_wrap(~ name.full, ncol = 4, scales = "free") +
theme_classic() +
theme(strip.background = element_blank()) +
labs(x = NULL, y = NULL)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Run PCA for clinical variables at baseline.
pca_clinic <- data_comb %>%
filter(time == "base") %>%
select(id, variables$clinic) %>%
as.data.frame() %>%
column_to_rownames(var = "id") %>%
select_if(is.numeric) %>%
drop_na() %>%
prcomp(center = TRUE, scale = TRUE)
# From broom website: '"u", "samples", "scores", or "x": returns information about the map from the original space into principle components space.'
tidy(pca_clinic, matrix = "samples")
# From broom website: '"v", "rotation", "loadings" or "variables": returns information about the map from principle components space back into the original space.'
tidy(pca_clinic, matrix = "variables")
# From broom website: '"d" or "pcs": returns information about the eigenvalues.'
tidy(pca_clinic, matrix = "pcs")
See broom website for more information.
Let’s plot a few of the standard PCA plot:
tidy(pca_clinic, matrix = "pcs") %>%
pivot_longer(cols = c("std.dev", "percent", "cumulative")) %>%
ggplot(aes(PC, value)) +
geom_line() +
facet_wrap(~ name, scales = "free") +
theme_minimal()
tidy(pca_clinic, matrix = "samples") %>%
mutate(PC = paste0("PC", PC),
row = as.character(row) %>% as.numeric()) %>%
pivot_wider(names_from = "PC", values_from = "value") %>%
left_join(select(data_comb, id, group), by = c("row" = "id")) %>%
ggplot(aes(PC1, PC2, color = group)) +
geom_hline(yintercept = 0, color = "grey", linetype = "dashed") +
geom_vline(xintercept = 0, color = "grey", linetype = "dashed") +
geom_point() +
stat_ellipse(linetype = "dashed") +
scale_color_brewer(name = NULL, palette = "Set1") +
theme_classic()
tidy(pca_clinic, matrix = "variables") %>%
mutate(PC = paste0("PC", PC)) %>%
pivot_wider(names_from = "PC", values_from = "value") %>%
left_join(annotation$clinic, by = c("column" = "name.short")) %>%
ggplot(aes(PC1, PC2)) +
geom_hline(yintercept = 0, color = "grey", linetype = "dashed") +
geom_vline(xintercept = 0, color = "grey", linetype = "dashed") +
geom_text(aes(label = column, color = facet.group)) +
scale_color_brewer(name = NULL, palette = "Dark2") +
theme_classic()
Add clustering figures??
An overview heatmap.
data_comb %>%
filter(time == "base") %>%
select(variables$nightingale) %>%
cor(use = "pairwise.complete.obs") %>%
pheatmap::pheatmap(show_rownames = FALSE, show_colnames = FALSE)
Or we can ask specific questions:
data_comb %>%
filter(time == "base") %>%
ggplot(aes(.panel_x, .panel_y)) +
geom_point() +
geom_smooth(se = FALSE) +
facet_matrix(cols = vars(tc, tg, ldlc, hdlc, apob, apoa1),
rows = vars(`Serum-C`, `Serum-TG`, `LDL-C`, `HDL-C`, ApoB, ApoA1)) +
theme_bw() +
labs(x = "Clinical measurements", y = "Nightingale measurements")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
data_comb %>%
filter(time == "base") %>%
ggplot(aes(.panel_x, .panel_y)) +
geom_point() +
geom_smooth(se = FALSE) +
facet_matrix(cols = vars(matches("-P$")),
rows = vars(matches("-L$"))) +
theme_bw() +
theme(strip.text.y = element_text(angle = 0))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
data_night_cor <- data_comb %>%
filter(time == "base") %>%
select(variables$nightingale) %>%
cor(use = "pairwise.complete.obs") %>%
as.data.frame() %>%
rownames_to_column(var = "var1") %>%
pivot_longer(cols = -var1, names_to = "var2", values_to = "r")
Let’s look at the distribution of correlation coefficients for all variable combinations.
data_night_cor %>%
ggplot(aes(r)) +
geom_histogram(bins = 100)
Those that are exactly 1 is of course the combination of the same variables. Let’s split into types of variables.
data_night_cor %>%
left_join(annotation$nightingale, by = c("var1" = "name.short")) %>%
left_join(annotation$nightingale, by = c("var2" = "name.short"), suffix = c("_var1", "_var2")) %>%
filter(unit_var1 %in% c("%", "mmol/l"), unit_var2 %in% c("%", "mmol/l")) %>%
ggplot(aes(r)) +
geom_histogram() +
facet_grid(rows = vars(unit_var1), cols = vars(unit_var2)) +
labs(x = "Correlation coefficient", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
data_night_cor_particle <- data_night_cor %>%
filter(str_detect(var1, "-P$")) %>%
left_join(select(annotation$nightingale, name.short, size), by = c("var2" = "name.short")) %>%
filter(!is.na(size)) %>% select(-size) %>%
mutate(var1 = str_replace_all(var1, "-P", ""))
data_night_cor_lipids <- data_night_cor %>%
filter(str_detect(var1, "-L$")) %>%
left_join(select(annotation$nightingale, name.short, size), by = c("var2" = "name.short")) %>%
filter(!is.na(size)) %>% select(-size) %>%
mutate(var1 = str_replace_all(var1, "-L", ""))
data_night_cor_comb <- left_join(
data_night_cor_particle,
data_night_cor_lipids,
by = c("var1", "var2"),
suffix = c("_particles", "_lipids")
)
data_night_cor_comb %>%
ggplot(aes(r_particles, r_lipids)) +
geom_point() +
theme_classic() +
labs(x = "Correlation between 'Particle concentration' and lipid species",
y = "Correlation between 'Total lipids' and lipid species")
data_night_cor_species <- data_night_cor %>%
left_join(annotation$nightingale, by = c("var1" = "name.short")) %>%
left_join(annotation$nightingale, by = c("var2" = "name.short"), suffix = c("_var1", "_var2")) %>%
filter(!is.na(size_var1), !is.na(size_var2),
str_detect(var1, "-P$|-L$", negate = TRUE),
str_detect(var2, "-P$|-L$", negate = TRUE)
) %>%
mutate(var1 = str_replace_all(var1, "_%", ""),
var2 = str_replace_all(var2, "_%", ""),
type_var1 = abbreviate(type_var1, minlength = 2L) %>% str_to_upper(),
type_var2 = abbreviate(type_var2, minlength = 2L) %>% str_to_upper())
data_night_cor_species %>%
ggplot(aes(x = fct_reorder(var1, name.order_var1),
y = fct_reorder(var2, -name.order_var2), fill = r)) +
geom_tile() +
scale_fill_gradient2() +
facet_grid(cols = vars(fct_rev(unit_var1), type_var1),
rows = vars(fct_rev(unit_var2), type_var2),
scales = "free", space = "free") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
axis.text.y = element_text(size = 8)) +
labs(x = NULL, y = NULL)
Clearly, lipid species associate strongly with each other for absolute values, but not for relative values (%).
Run PCA for all Nightingale variables at baseline and end-of-study.
# Baseline
pca_nightingale_base <- data_comb %>%
filter(time == "base") %>%
select(id, variables$nightingale) %>%
as.data.frame() %>%
column_to_rownames(var = "id") %>%
select_if(is.numeric) %>%
drop_na() %>%
prcomp(center = TRUE, scale = TRUE)
# End-of-study
pca_nightingale_end <- data_comb %>%
filter(time == "end") %>%
select(id, variables$nightingale) %>%
as.data.frame() %>%
column_to_rownames(var = "id") %>%
select_if(is.numeric) %>%
drop_na() %>%
prcomp(center = TRUE, scale = TRUE)
Again, see the broom website for more information.
Let’s plot a few of the standard PCA plot:
It’s no use plotting the loadings plot (“variables”), since there are so many variables. We would have drastic overplotting as a result.
tidy(pca_nightingale_base, matrix = "pcs") %>%
pivot_longer(cols = c("std.dev", "percent", "cumulative")) %>%
ggplot(aes(PC, value)) +
geom_line() +
facet_wrap(~ name, scales = "free") +
theme_minimal()
pca_nightingale_scores_prep <- bind_rows(
tidy(pca_nightingale_base, matrix = "samples") %>%
mutate(PC = paste0("PC", PC),
row = as.character(row) %>% as.numeric(),
time = "base") %>%
filter(PC %in% c("PC1", "PC2")) %>%
pivot_wider(names_from = "PC", values_from = "value"),
tidy(pca_nightingale_end, matrix = "samples") %>%
mutate(PC = paste0("PC", PC),
row = as.character(row) %>% as.numeric(),
time = "end") %>%
filter(PC %in% c("PC1", "PC2")) %>%
pivot_wider(names_from = "PC", values_from = "value")) %>%
left_join(select(data_comb, id, group, time), by = c("row" = "id", "time"))
pca_nightingale_scores_prep %>%
ggplot(aes(PC1, PC2)) +
geom_hline(yintercept = 0, color = "grey", linetype = "dashed") +
geom_vline(xintercept = 0, color = "grey", linetype = "dashed") +
geom_point(aes(color = time)) +
stat_ellipse(aes(color = time), linetype = "dashed") +
scale_color_brewer(name = NULL, palette = "Dark2") +
facet_wrap(~ group) +
theme_classic() +
theme(strip.background = element_blank(),
legend.position = "top")
Here I can add something about clustering techniques.
This concludes the script. Some conclusions and take-homes:
To improve reproducibility, print out the session info for this script.
devtools::session_info()
## - Session info ----------------------------------------------------------
##
## - Packages --------------------------------------------------------------
## package * version date lib source
## acepack 1.4.1 2016-10-29 [1] CRAN (R 3.6.0)
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
## backports 1.1.4 2019-04-10 [1] CRAN (R 3.6.0)
## base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.6.0)
## broom * 0.5.2 2019-04-07 [1] CRAN (R 3.6.0)
## callr 3.3.0 2019-07-04 [1] CRAN (R 3.6.1)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.0)
## checkmate 1.9.4 2019-07-04 [1] CRAN (R 3.6.1)
## class 7.3-15 2019-01-01 [2] CRAN (R 3.6.0)
## cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0)
## cluster 2.0.8 2019-04-05 [2] CRAN (R 3.6.0)
## colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
## data.table 1.12.2 2019-04-07 [1] CRAN (R 3.6.0)
## DBI 1.0.0 2018-05-02 [1] CRAN (R 3.6.0)
## desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
## devtools 2.1.0 2019-07-06 [1] CRAN (R 3.6.1)
## digest 0.6.20 2019-07-04 [1] CRAN (R 3.6.1)
## dplyr * 0.8.2 2019-06-29 [1] CRAN (R 3.6.0)
## e1071 1.7-2 2019-06-05 [1] CRAN (R 3.6.0)
## ellipsis 0.2.0.1 2019-07-02 [1] CRAN (R 3.6.1)
## evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
## fansi 0.4.0 2018-10-05 [1] CRAN (R 3.6.0)
## farver 1.1.0 2018-11-20 [1] CRAN (R 3.6.1)
## forcats * 0.4.0 2019-02-17 [1] CRAN (R 3.6.0)
## foreign 0.8-71 2018-07-20 [2] CRAN (R 3.6.0)
## Formula 1.2-3 2018-05-03 [1] CRAN (R 3.6.0)
## fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0)
## generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0)
## ggforce * 0.3.1 2019-08-20 [1] CRAN (R 3.6.1)
## ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1)
## ggrepel 0.8.1 2019-05-07 [1] CRAN (R 3.6.1)
## glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
## gridExtra * 2.3 2017-09-09 [1] CRAN (R 3.6.1)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0)
## haven * 2.1.1 2019-07-04 [1] CRAN (R 3.6.1)
## highr 0.8 2019-03-20 [1] CRAN (R 3.6.0)
## Hmisc 4.2-0 2019-01-26 [1] CRAN (R 3.6.0)
## hms 0.5.0 2019-07-09 [1] CRAN (R 3.6.0)
## htmlTable 1.13.1 2019-01-07 [1] CRAN (R 3.6.0)
## htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.6.0)
## htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.6.0)
## httr 1.4.0 2018-12-11 [1] CRAN (R 3.6.0)
## inline 0.3.15 2018-05-18 [1] CRAN (R 3.6.1)
## jsonlite 1.6 2018-12-07 [1] CRAN (R 3.6.0)
## knitr * 1.23 2019-05-18 [1] CRAN (R 3.6.0)
## labeling 0.3 2014-08-23 [1] CRAN (R 3.6.0)
## labelled 2.2.1 2019-05-26 [1] CRAN (R 3.6.1)
## lattice 0.20-38 2018-11-04 [2] CRAN (R 3.6.0)
## latticeExtra 0.6-28 2016-02-09 [1] CRAN (R 3.6.0)
## lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.0)
## lifecycle 0.1.0 2019-08-01 [1] CRAN (R 3.6.1)
## loo 2.1.0 2019-03-13 [1] CRAN (R 3.6.1)
## lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.6.0)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
## MASS 7.3-51.4 2019-03-31 [2] CRAN (R 3.6.0)
## Matrix 1.2-17 2019-03-22 [2] CRAN (R 3.6.0)
## matrixStats 0.54.0 2018-07-23 [1] CRAN (R 3.6.1)
## memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
## mitools 2.4 2019-04-26 [1] CRAN (R 3.6.1)
## modelr 0.1.4 2019-02-18 [1] CRAN (R 3.6.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0)
## nlme 3.1-139 2019-04-09 [2] CRAN (R 3.6.0)
## nnet 7.3-12 2016-02-02 [2] CRAN (R 3.6.0)
## openxlsx * 4.1.2 2019-10-29 [1] CRAN (R 3.6.1)
## packrat 0.5.0 2018-11-14 [1] CRAN (R 3.6.1)
## pheatmap 1.0.12 2019-01-04 [1] CRAN (R 3.6.0)
## pillar 1.4.2 2019-06-29 [1] CRAN (R 3.6.0)
## pkgbuild 1.0.3 2019-03-20 [1] CRAN (R 3.6.0)
## pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.6.0)
## pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
## plyr 1.8.4 2016-06-08 [1] CRAN (R 3.6.0)
## polyclip 1.10-0 2019-03-14 [1] CRAN (R 3.6.0)
## prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0)
## processx 3.4.0 2019-07-03 [1] CRAN (R 3.6.1)
## ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0)
## purrr * 0.3.2 2019-03-15 [1] CRAN (R 3.6.0)
## R6 2.4.0 2019-02-14 [1] CRAN (R 3.6.0)
## RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 3.6.0)
## Rcpp 1.0.1 2019-03-17 [1] CRAN (R 3.6.0)
## readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.0)
## readxl * 1.3.1 2019-03-13 [1] CRAN (R 3.6.0)
## remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0)
## reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.6.0)
## rlang 0.4.0 2019-06-25 [1] CRAN (R 3.6.0)
## rmarkdown * 1.13 2019-05-22 [1] CRAN (R 3.6.0)
## rpart 4.1-15 2019-04-12 [2] CRAN (R 3.6.0)
## rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
## rstan 2.19.2 2019-07-09 [1] CRAN (R 3.6.1)
## rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.0)
## rvest 0.3.4 2019-05-15 [1] CRAN (R 3.6.0)
## scales 1.0.0 2018-08-09 [1] CRAN (R 3.6.0)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
## StanHeaders 2.19.0 2019-09-07 [1] CRAN (R 3.6.1)
## stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
## survey 3.36 2019-04-27 [1] CRAN (R 3.6.1)
## survival 2.44-1.1 2019-04-01 [2] CRAN (R 3.6.0)
## tableone 0.10.0 2019-02-17 [1] CRAN (R 3.6.1)
## testthat 2.1.1 2019-04-23 [1] CRAN (R 3.6.1)
## tibble * 2.1.3 2019-06-06 [1] CRAN (R 3.6.0)
## tidyr * 1.0.0 2019-09-11 [1] CRAN (R 3.6.1)
## tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.0)
## tidyverse * 1.2.1 2017-11-14 [1] CRAN (R 3.6.1)
## tinytex 0.14 2019-06-25 [1] CRAN (R 3.6.0)
## tweenr 1.0.1 2018-12-14 [1] CRAN (R 3.6.1)
## usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
## utf8 1.1.4 2018-05-24 [1] CRAN (R 3.6.0)
## vctrs 0.2.0 2019-07-05 [1] CRAN (R 3.6.1)
## withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
## xfun 0.8 2019-06-25 [1] CRAN (R 3.6.0)
## xml2 1.2.0 2018-01-24 [1] CRAN (R 3.6.0)
## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.0)
## zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.6.0)
## zip 2.0.4 2019-09-01 [1] CRAN (R 3.6.1)
## zoo 1.8-6 2019-05-28 [1] CRAN (R 3.6.1)
##
## [1] C:/Users/jacobjc/Documents/R/win-library/3.6
## [2] C:/Program Files/R/R-3.6.0/library